Research of Metanprom Bank clients

A regional bank with branches in Yaroslavl and the regional cities of Rostov and Rybinsk wants to know how customers use its services.

The purpose of the study:

Research objectives:

Available information:

The research process:

  1. General Information Study.
  2. Data Preprocessing.
  3. Exploratory Data Analysis.
  4. User Segmentation.
  5. Hypothesis Testing.
  6. General Conclusion and Recommendations.

General Information Study

Column headers are in different styles, it is not always clear what data they describe.

The dataset contains numerical and categorical features, as well as categorical features that are encoded by numbers (for example, the presence of a credit card or accuracy). There are missing values in the Balance column.

The minimum salary of 11.58 rubles looks like an erroneous value.

Conclusions

There are 10,000 records in the dataset.

There are twelve columns in total:

Column names do not correspond to the accepted style. There are missing values in the Balance column. There are also categorical features that are encoded by numbers, this should be taken into account for further work with the data.

Data Preprocessing

There are missing values in the balance column. Perhaps these are churn clients. Let's check the hypothesis.

There are more missing values among non-churn clients. Let's see if there is a correlation between the balance and salary to fill in the missing values.

The balance does not correlate with salary. Let's check if the missing values are related to specific branches of the bank.

All the missing values in the balance column are in Rybinsk and Yaroslavl branches. The reason may be that the bank's branch in Rostov worked only with salary clients.

We are leaving the missing values as they are.

The estimated_salary column has an abnormally low minimum. Let's look at the box-and-whisker diagram.

There are no outliers. The 25th percentile is 50 000 rubles. Let's look at the 2nd and 1st percentiles.

One percent of clients from the dataset have a salary estimated by the bank to be less than or equal to 1,842 rubles. Perhaps this is due to the peculiarities of salary assessment or the fact that the client works part-time. We will delete records from the dataset with a salary of less than 1000 rubles: such a salary looks abnormal even for Russian small towns.

Conclusions

There are no duplicates in the data. The dataset was processed as follows:

There are 9,941 records left in the dataset after processing.

Exploratory Data Analysis

We look at the distribution of the following features: credit scoring points, age, number of objects owned, balance, salary.

Let's see how many products customers use.

The vast majority of clients use one (5,053 clients) or two (4,562 clients) products.

Let's look at the distribution of clients by cities where there are bank branches.

Half of the bank's clients are from Yaroslavl, approximately 25% of clients are in Rostov and Rybinsk.

Let's look at the distribution of features such as gender, credit card availability, loyalty, churn (whether the client left or not).

Among the bank's clients, there are almost a thousand more men than women (5427 and 4514, respectively). 70% of customers have a credit card. The number of loyal and disloyal clients is almost the same with a slight advantage in the direction of loyal. About 20% of the bank's customers had left at the time of the research.

Before looking at the correlation of features, we'll transform the categorical features of gender and city into a suitable form using fast coding.

We will also apply fast coding to the products column: since our goal is to segment the bank's customers by the number of products, we will check how the number of products correlates with other characteristics.

Let's build on the correlation matrix of features.

The heat map shows that the features are almost uncorrelated. Perhaps the features have non-linear connections. Let's check the correlation between the signs using the Phik correlation coefficient and build a heat map.

Let's look at the churn. There is a moderate positive correlation with age (the highest value among all features), as well as with the "3 products" feature.

In general, Rostov differs from Yaroslavl and Rybinsk on the basis of "churn" feature.

Let's build a scatter plot matrix based on the main quantitative features and add a color designation depending on the value of the churn column.

We can say that the clients who left have the same distribution of credit scoring, balance and salary as those who remained, but the median age of those who left is higher: 45 years against 36 for the remaining clients. This may be due to the fact that the churn clients used only one product (for example, they took out a loan and paid it off), when the need for it disappeared, such customers left.

It is also noticeable that a significant proportion of the exact customers are aged from 40 to 65 years. After 65, their share decreases.

Conclusions:

"Churn" feature is directly correlated with age, as well as with "three products". There is a weak direct correlation with the "four products" feature and the city of Rostov, as well as a weak inverse correlation with loyalty, the "two products" feature, the city of Yaroslavl and gender.

For clients who left, the distribution of credit scoring, balance and salary is the same as for the remaining ones, but the median age of those who left is higher: 45 years versus 36 for the remaining clients. This may be due to the fact that the churn clients used only one product (for example, they took out a loan and paid it off), when the need for it disappeared, such clients left.

A significant proportion of the churn clients are aged from 40 to 65 years. After 65, their share decreases.

User Segmentation

We divide users into segments by the number of products, by age, and also by bank branch.

Segmentation by number of products

From the data analysis, we know that most clients use one (5,503) or two (4,562) bank products. A very small proportion of clients use three (266) or four (60) products.

Also, according to the correlation matrix of features, we can say that the features "three products" and "four products" are similar to each other and differ from the other signs (one and two products).

Therefore, we will divide customers into three segments according to the number of products they use:

  1. one product,
  2. two products,
  3. three or four.

Let's see how, in general, the bank's customers are divided into segments depending on the number of products.

Let's look at the charts of the scope of credit scoring, age, number of objects owned, balance and salary in the context of segments by the number of products.

On the boxplots, we see that the segments practically do not differ in such features as credit scoring, account balance and salary.

There are differences in age: the "youngest" segment — customers with two products, the most "adult" — with three and four . Median age of clients:

Also, with the same median number of objects owned by clients of all segments,

Let's see how these three segments are distributed across cities.

In Yaroslavl and Rybinsk, the share of segments is almost the same, in Rostov there are slightly more segments with one product and with 3+, but less with two.

Let's look at the percentage of customers in each segment according to the following features: credit card availability, loyalty, churn.

The shares of clients with and without a credit card are almost equal in each segment: 70% of customers have a credit card.

Loyal customers are more among those who use two products — 53%, less among customers with 3+ products — 44%.

The highest share of churned customers in the segment of those who use three and four products is 86%. The lowest among those with two products is 8%.

Conclusion:

We have divided customers into three segments depending on the number of products they use. The segments practically do not differ in such features as credit scoring, balance and salary, 70% of customers in each segment have a credit card. Let's describe the other features.

  1. One product. These customers make up the majority, 50.8%. The median age is 38 years. Half of the clients have from 2 to 7 objects owned, 25% have 7 or more. The largest percentage of customers with one product in Rostov (54%). There are the same number of loyal and disloyal customers. The share of churned customers is 28%.
  1. Two products. Customers with two products make 45.9%. The median age is 36 years. Half of the clients have from 3 to 7 objects owned, 25% have 7 or more. The lowest percentage of customers with two products in Rostov, 41%. The highest share of loyal customers, 53%. The lowest share of departed customers (only 8%).
  1. Three and four products. This segment is the smallest of all the bank's customers (3.28%). The median age is 43 years. Half of the clients have from 3 to 8 objects owned, 25% have 8 or more. In Rostov, the share of such customers is higher — 5%. In this segment, the lowest share of loyal customers (44%), and the highest share of the churned (86%).

Segmentation by age

From the EDA, we know that a significant proportion of churned customers are aged 40 to 65 years. After 65, their share decreases.

Therefore, we will divide customers into three segments by age:

  1. 18-40 years old,
  2. 40-65 years old,
  3. 65+ years old.

Let's see how, in general, the bank's customers were divided into segments depending on age.

The largest segment is of 18-40-year-old customers (almost 60%), 37% in the 40-65-year-old segment, the smallest segment is of 65+ years — about 3%.

Let's look at the boxplots of credit scoring, the number of objects owned, balance and salary by age segments.

On the boxplots, we see that the segments practically do not differ in salaries (we will consider this a feature of the training dataset).

The credit scoring is higher for the "65+" segment.

With the same median balance in all three segments, "65+" has a smaller scope — from 50 to 187 thousand rubles. Segments "18-40" and "40-65" have outliers both up and down.

Also, with the same median number of objects owned by customers of all segments,

Let's look how the age segments are distributed in each of the cities of the bank's presence.

Rostov differs from Yaroslavl and Rybinsk in the distribution of age segments: in this city, the "18-40" segment is smaller than in the other two cities, and is 55% vs. 61% and 62%, and the "40-65" segment is larger — 43% vs. 37% and 35%.

Let's see how many products are used by customers in each age segment.

Let's look at the percentage of customers in each segment according to the following features: credit card availability, loyalty, churn.

Conclusions:

Based on the results of EDA, we divided the clients into three age segments. The segments practically do not differ in salary (a feature of the dataset), about 70% of customers in each segment have a credit card. Let's describe the other features.

  1. 18-40 years old. This segment is the largest (59.9%). Half of the clients have from 3 to 7 objects owned, 25% have 7 or more. Half of the customers use two products, 48% one, only 2% three and four. There are the equal number of loyal and disloyal customers, while proportion of churned customers is the lowest (10)%.

  2. 40-65 years old. This segment accounted for 37.3% of the total number of customers. Half of the clients have from 2 to 8 objects owned, 25% have 8 or more. 56% of customers use one product, 39% two, 5.5% three and four. Loyal and disloyal customers are approximately equal, the highest proportion of churned customers (37%).

  3. 65+ years old. The smallest segment — 2.82% of all the bank's customers. Customers in this segment are more reliable: the median credit score is 660 points. Half of the clients have from 3 to 8 objects owned, 25% have 8 or more. With the same median balance in all three segments, "65+" has a smaller scope — from 50 to 187 thousand rubles. 52% of customers have one product, 45% have two, 3.6% have 3 and 4. This segment has the highest share of loyal customers — 86%, churned customers — 15%.

Segmentation by branch

At the stage of EDA, as well as user segmentation by product and age, we paid attention to the following:

We can conclude that, in general, the client of the branch in Rostov differ in behavior from customers in Yaroslavl and Rybinsk. We will divide customers into segments by branches to look at the differences:

  1. Yaroslavl,
  2. Rybinsk,
  3. Rostov.

Half of the bank's customers account for Yaroslavl (the regional center of Yaroslavl region), in Rybinsk and Rostov about 25% of all customers.

Let's look at the boxplots of credit scoring, the number of objects owned, balance and salary by branch.

On the boxplots, we see that the salaries of clients in different branches practically do not differ.

Clients in Rostov have a higher median age of 38 years (in Yaroslavl and Rybinsk — 37), 25% of clients are over 45 years old (in Yaroslavl and Rybinsk over 43 and 44, respectively).

With the same median balance in all three segments, there is less scope in Rostov — 50% of customers have a balance from 103 to 138 thousand rubles.

Also, with the same median number of objects owned by customers of all branches,

Let's look at the percentage of clients in each branch according to the following features: credit card availability, loyalty, churn.

Rostov differs from other branches in 'churn' feature: the share of churned customers for this branch is twice as high as for others (32%).

Such a high churn level is due to the fact that:

Such features of this particular city are most likely related to the demographic situation. According to Rosstat (Russian Federal State Statistics Service), as of 2022, the population of Rostov is 30 thousand people, and Rybinsk is 182 thousand. At the same time, the number of bank customers in these two cities is almost the same.

Clients from Rostov are older on average, because, most likely, young people leave for larger cities. At the same time, the city has the highest proportion of bank customers — 8% of residents, which can be explained by low competition among banks in the city, since it is relatively small.

Hypothesis Testing

We are testing two hypotheses:

  1. There are statistically significant differences in the average income of customers who use two bank products, and those who use one.
  2. There are statistically significant differences in the average value of the balance of customers who use two bank products, and those who use one.

We are testing the hypothesis about the difference in the average value of income for customers who use one and two products

In this case, we will test the hypothesis about the equality of the averages of the two general populations. We formulate the null and alternative hypotheses.

Null hypothesis: the average income values of customers who use two products of the bank and those who use one are equal.

Alternative hypothesis: the average income values of customers who use two products of the bank, and those who use one, are not equal.

There is no difference in the average value of income for customers who use one and two products.

We are testing the hypothesis about the difference in the average balance value of customers who use one and two products

In this case, we will also test the hypothesis of the equality of the averages of the two general populations. We formulate the null and alternative hypotheses.

Null hypothesis: the average balance values of customers who use two bank products and those who use one are equal.

Alternative hypothesis: the average balance values of customers who use two products of the bank, and those who use one, are not equal.

The average balance value of customers who use one product and two does not differ.

General Conclusion and Recommendations

1. Study and preprocessing There are 10 000 records in the dataset. There are no duplicates. The dataset was processed as follows:

There are 9 941 records left in the dataset after processing.

2. Exploratory Data Analysis

"Churn" feature is directly correlated with age, as well as with the "three products" feature. There is a weak direct correlation with the "four products" feature and the city of Rostov. The median age of the churned clients is higher: 45 years versus 36 for the remaining clients. A significant proportion of the churned customers are aged from 40 to 65 years. After 65, their share decreases.

3. Customer segmentation

  1. According to the number of products, customers were divided into three segments:

The segments practically do not differ in such features as credit scoring, balance and salary, 70% of customers in each segment have a credit card.

  1. One product. The number of such clients is the majority — 50.8%. The median age is 38 years. Half of the clients have from 2 to 7 objects owned, 25% have 7 or more. There are the same number of loyal and disloyal customers. The share of churned customers is 28%.

  2. Two products. Clients with two products — 45.9%. The median age is 36 years. Half of the clients have from 3 to 7 objects owned, 25% have 7 or more. The highest share of loyal customers (53%). The lowest share of churned customers (only 8%).

  3. Three and four products. The smallest segment: only 3.28% of all the bank's customers. The median age is 43 years. Half of the clients have from 3 to 8 objects owned, 25% have 8 or more. In this segment, the lowest share of loyal customers (44%), and the highest share of the churned clients (86%).

  1. By age, customers were divided into three segments:
  1. 18-40 years old. The number of such clients is the majority — 59.9%. Half of the clients have from 3 to 7 objects owned, 25% have 7 or more. Half of the customers use two products, 48% one, only 2% three and four. There are the same number of loyal and disloyal customers, the lowest proportion of churned customers (10%).

  2. 40-65 years old. This segment accounted for 37.3% of the total number of customers. Half of the clients have from 2 to 8 objects owned, 25% have 8 or more. 56% of customers use one product, 39% two, 5.5% three and four. Loyal and disloyal customers are approximately equal, the highest proportion of churned customers (37%).

  3. 65+ years old. The smallest segment — 2.82% of all the bank's customers. Clients in this segment are more reliable: the median credit score is 660 points. Half of the clients have from 3 to 8 objects owned, 25% have 8 or more. 52% of customers have one product, 45% have two, 3.6% have 3 and 4. This segment has the highest share of loyal customers — 86%, churned customers — 15%.

We also concluded that, in general, the customers of the branch in Rostov differ in behavior from customers in Yaroslavl and Rybinsk, and studied the differences.

Rostov differs from other branches in "churn" feature: the share of churned customers for in branch is twice as high as in others (32%).

Such a high churn rate is due to the fact that:

Such features of this particular city are most likely related to the demographic situation. According to Rosstat (Russian Federal State Statistics Service), as of 2022, the population of Rostov is 30 thousand people, and Rybinsk is 182 thousand. At the same time, the number of bank clients in these two cities is almost the same.

Clients from Rostov are older on average, because, most likely, young people leave for larger cities. At the same time, the city has the highest proportion of bank customers — 8% of residents, which can be explained by low competition among banks in the city, since it is relatively small.

4. Hypothesis testing

We tested two hypotheses:

  1. There are statistically significant differences in the average income of customers who use two products of the bank, and those who use one.
  2. There are statistically significant differences in the average value of the balance of customers who use two products of the bank, and those who use one.

Both hypotheses were not confirmed: there were no differences in the average income and average balance of customers with two and one products.

5. Recommendations

  1. Develop a product aimed at customers aged 40-65 years: more than half of these customers use only one product (most likely, a loan) and they have the highest percentage of churn rate (they paid off the loan and left). Such a product can be debit cards with cashback, loans for the education of children with a reduced rate, mortgage loans.

  2. Offer a new product to older customers from Rostov, for example, increased deposit rates for retired people. This will help to reduce churn, because the "65+" age segment has the highest loyalty.